A Bayesian Spatial Scan Statistic for Under-reported Data

August 14, 2025

Table of Contents

  • Introduction
  • Proposed Method
  • Simulation Study
  • Application: Texas COVID-19 Data
  • Discussion and Conclusion

Introduction

Public Health Surveillance

Public health surveillance
The systematic, ongoing assessment of the health of a community including the timely collection, analysis, interpretation, dissemination and subsequent use of data. 1

Outbreak Detection

A subset of disease surveillance methods focus on disease progression and outbreak detection.

Novel disease monitoring

New diseases often lack reliable testing and reporting systems. Early cases may be missed or misclassified, obscuring disease surveillance techniques that assume complete cases.

Examples

  • COVID-19
  • HIV/AIDS
  • Tuberculosis (TB)

Accounting for Under-reporting

Most methods proposed for modeling under-reported or misclassified data fall into two categories:

  1. Double sampling
  2. latent variable models

Spatial Scan Statistics

General Concept

Scan statistics

  1. Select candidate regions
  2. Calculate relative risk inside and outside of candidate region
  3. Determine region with largest difference

Visualization of Spatial Scan Regions

Frequentist Spatial Scan Statistic

  • The framework assumes that we observe counts \(z_i\) such that \(z_i \sim \text{Poisson}(qb_i)\)
    • Where \(b_i\) represents the known baseline/at risk population of cell \(S_i\)
    • \(q\) is the unknown underlying disease rate

\[ H_0: \text{No cluster (common rate for all regions)} \\ H_1(S): \text{Cluster in subset }S\text{ with elevated rate vs. outside } S \]

  • Compute likelihood ratio test statistic for each candidate zone \(S\)
  • The scan statistic test statistic is \(\Lambda = \max_{S \in C}\lambda(S)\).
  • Generate Monte Carlo samples under \(H_0\) to calculate P-value

Bayesian Spatial Scan Statistics

  • Assuming we observe count data \(z_i\) in area \(i\), each associated with baeline \(b_i\)
  • Under the null hypothesis there is no cluster and all locations share \(q_{all}\) \[ z_i \sim \text{Poisson}(q_{all} b_i), \quad q_{all} \sim \text{Beta}(\alpha_{all}, \beta_{all})\]
  • The alternative hypothesis for each candidate cluster \(i \in \mathcal{S}\), where \(\mathcal{S}\) is the space of all possible clusters \[ \begin{cases} z_i \sim \text{Poisson}(q_{in} b_i), &i \in S, \quad q_{in} \sim \text{Gamma}(\alpha_{in}, \beta_{in}), \\ z_i \sim \text{Poisson}(q_{out} b_i), &i \notin S, \quad q_{out} \sim \text{Gamma}(\alpha_{out}, \beta_{out}). \end{cases} \]
  • Marginal likelihoods based on the gamma-Poisson model
  • Conjugate model possible to solve for closed form solution

Bayesian Spatial Scan Statitic Testing

  • Using the maringal likelihoods from the models the posterior probability under the null is \[P(H_0 | D) = \frac{P(D|H_0) P(H_0)}{P(D)}\]
  • The posterior probability under the alternative is \[P(H_1(i) | D) = \frac{P(D|H_1(i)) P(H_1(i))}{P(D)}\]
  • Then we can return regions with non-negligible posterior probabilities
  • Since we have the full posterior probability distributions there is no need for randomization testing
  • Bayes factors can be used to provide a direct measure of evidence for one hypothesis over the other

Bayesian Interpretation

Interpretation of Bayes factor
BF Log(BF) Strength of evidence \(H_0\)
1 to 3.2 0 to 1.16 Not Significant
3.2 to 10 1.16 to 2.30 Positive
10 to 100 2.30 to 4.61 Strong
\(>\) 100 \(> 4.61\) Decisive

Scan Statistic Timeline

timeline
    title Spatial Scan Statistic Development
    1965 : Conceptual basis - Naus
    1997 : Basic Spatial Scan Statistic (Frequentist)
    1998 : Space-Time Extension (Frequentist)
    2005 : Flexible Shapes (Frequentist)
    2005 : Bayesian Spatial Scan Statistic
    2007 : Multivariate Spatial Scan Statistic (Frequentist)
    2012 : Overdispersed data extension (Frequentist) 
    2017 : Bayesian Spatial Scan Statistic for Zero-inflated count data
    2018 : Wald-based Spatial Scan Statistics (Frequentist)
    2024 : Bayesian Spatial Scan Statistic for Multinormal data

  • Since the formalization in 1997 spatial scan statistics have been used and described as a method for epidemiologists
  • No extension to account for under-reported count data

Proposed Method

Model

  • We propose a novel Bayesian spatial scan statistic model by modeling the true counts as a latent variable and introducing reporting probability \(p\).
  • Our spatial scan statistic is based on the hierarchical model \[ z_i \sim \text{Poisson}(p \times q \times b_i) \\ q \sim \text{gamma}(\alpha, \beta) \\ p \sim \text{beta}(\alpha_p, \beta_p) \]
  • Model no longer conjugate

Bayesian Spatial Scan Statistic Extension

  • The new null hypothesys assumes no clusters \[ z_i \sim \text{Poisson}(p \times q_{all} \times b_i), \quad q_{all} \sim \text{gamma}(\alpha_{all}, \beta_{all}), \quad p \sim \text{beta}(\alpha, \beta) \]
  • The resulting alternative hypothesys for region \(i\) is \[ \begin{cases} z_i \sim \text{Poisson}(p \times q_{in} \times b_i), &i \in S, \quad q_{in} \sim \text{Gamma}(\alpha_{in}, \beta_{in}), \\ z_i \sim \text{Poisson}(p \times q_{out} \times b_i), &i \notin S, \quad q_{out} \sim \text{Gamma}(\alpha_{out}, \beta_{out}). \end{cases} \\ p \sim \text{beta}(\alpha, \beta) \]

Setting Priors

  • Necessary to set an informative prior on reporting rate \(p\)
    • Historical Data
    • Expert elicitation
  • Can set a difuse prior on the \(q\) parameters

Posterior Estimation

  • The marginal likelihood under a candidate region \(i\) is now: \[P(D|H_1(S)) = \int \int \int P(D|q_{in}, q_{out}, p) \times \\ \quad \pi(q_{in}) \times \pi(q_{out}) \times \pi(p) dq_{in} dq_{out} dp\]
  • Posterior samples are obtained through MCMC sampling in stan

Decision Making

  • Decision should be based on estimate of risk ratio within candidate cluster and outside. \[ \widehat{RR} = \frac{\widehat{q_{in}}}{\widehat{q_{out}}} \]
  • Bayes factors provide evidence for alternative hypothesis
    • calculated using bridge sampling1 in R
  • Most likely cluster selected based on largest risk ratio and Bayes factor

Simulation Study

Simulation Design

  • 39 counties of Washington state with an outbreak of 3 counties in south eastern Washington
  • Baseline values where determined by 100,000 total cases to start
  • 50 simulated data sets for each set of parameters
    • Reporting rate: 0.1, 0.2, 0.3, 0.4, and 0.5
    • Outbreak effect (\(\Delta = q_{in} - q_{out}\)): 0.15, 0.20, 0.25, 0.30, 1.0, and 3.01

Simulation priors

  • Priors for reporting rate \(p\) where set using the betabuster tool in epiR package \[ p \sim \text{Beta}(3.5, 23) \quad \text{if} \quad p = 0.1 \\ p \sim \text{Beta}(4.5, 15) \quad \text{if} \quad p = 0.2 \\ p \sim \text{Beta}(10, 22) \quad \text{if} \quad p = 0.3 \\ p \sim \text{Beta}(13, 19) \quad \text{if} \quad p = 0.4 \\ p \sim \text{Beta}(1, 1) \quad \text{if} \quad p = 0.5 \]
  • Priors for \(q\) \[ q_{all} \sim \text{gamma}(2, 0.5) \\ q_{out} \sim \text{gamma}(2, 0.4) \\ q_{in} \sim \text{gamma}(2, 0.5) \]

Simulation Metrics

Even when the null hypothesis is correctly rejected, the detected clusters rarely match the true cluster exactly.

To evaluate how well they overlap we will use:

  • Power: Proportion of detected clusters exactly match true cluster
  • Sensitivity: Proportion of true cases correctly included
  • Positive Predicted Value (PPV): Proportion of detected cases that are actually in the true cluster

Simulation Results Visual

Application: Texas COVID-19 Data

Texas COVID-19 Data

  • COVID-19 data in early 2020 were severely under-reported due to limited testing and difficulty to diagnose Hortaçsu, Liu, and Schwieg (2021)
  • Data (254 Counties)
    • COVID-19 cases (Probable and Confirmed)
    • Population

Real Data (priors)

  • Estimates from early COVID-19 studies suggest very low reporting rates (\(\approx 10\%\)), with low probability of exceeding 30\(\%\) Chen, Song, and Stamey (2022).
  • This information results in a prior of \(p \sim \text{Beta}(7, 55)\)
  • Difusse priors where fit to \(q_\cdot\) parameters

\[ q_{all} \sim \text{gamma}(1, 0.1) \\ q_{out} \sim \text{gamma}(1, 0.1) \\ q_{in} \sim \text{gamma}(1, 0.1) \]

Real Data Results

Both methods provide different most likely clusters;

  • Naive: Around the city of Houston
  • Under-reported: Around El Paso and north of DFW.

Bayes factors for each identified cluster is very large indicating significant evidence in favor of \(H_1\) over \(H_0\).

Discussion

  • Traditional scan statistics may fail when case counts are under-reported, common in emerging outbreaks
  • The proposed method models reporting probability, improving cluster detection under incomplete data
  • Comparison with confirmed cases suggest some true clusters (Texas Panhandle) remain undetected, indicating further refinement is needed

Future work

  • Extend to spatiotemporal model for real-time detection
  • Incorporate multivariate outcomes
  • Allow spatially varying rates to reflect local testing access

Bibliography

Chen, Jinjie, Joon Jin Song, and James D. Stamey. 2022. “A Bayesian Hierarchical Spatial Model to Correct for Misreporting in Count Data: Application to State-Level COVID-19 Data in the United States.” International Journal of Environmental Research and Public Health 19 (6): 3327. https://doi.org/10.3390/ijerph19063327.
Gronau, Quentin F, Henrik Singmann, and Eric-Jan Wagenmakers. n.d. “Bridgesampling: An R Package for Estimating Normalizing Constants.”
Hortaçsu, Ali, Jiarui Liu, and Timothy Schwieg. 2021. “Estimating the Fraction of Unreported Infections in Epidemics with a Known Epicenter: An Application to COVID-19.” Journal of Econometrics, Pandemic Econometrics, 220 (1): 106–29. https://doi.org/10.1016/j.jeconom.2020.07.047.
Kulldorff, Martin. 1997. “A Spatial Scan Statistic.” Communications in Statistics - Theory and Methods 26 (6): 1481–96. https://doi.org/10.1080/03610929708831995.
Kulldorff, Martin, Farzad Mostashari, Luiz Duczmal, W. Katherine Yih, Ken Kleinman, and Richard Platt. 2007. “Multivariate Scan Statistics for Disease Surveillance.” Statistics in Medicine 26 (8): 1824–33. https://doi.org/10.1002/sim.2818.
Neill, Daniel B. 2024. “Bayesian Scan Statistics.” In Handbook of Scan Statistics, edited by Joseph Glaz and Markos V. Koutras, 83–103. New York, NY: Springer. https://doi.org/10.1007/978-1-4614-8033-4_28.
Neill, Daniel, Andrew Moore, and Gregory Cooper. 2005. “A Bayesian Spatial Scan Statistic.” In Advances in Neural Information Processing Systems. Vol. 18. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2005/hash/28acfe2da49d2b9a7f177458256f2540-Abstract.html.
Shao, Kan, Yandong Liu, and Daniel B. Neill. 2011. “A Generalized Fast Subset Sums Framework for Bayesian Event Detection.” In 2011 IEEE 11th International Conference on Data Mining, 617–25. https://doi.org/10.1109/ICDM.2011.11.